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Abstract 

Our  system  to  KBA  Track  at  TREC2012  is  described  in  this  paper,  which  includes  preprocessing, 
index  building,  relevance  feedback  and  similarity  calculation.  In  particular,  the  Jaccard  coefficient 
was  applied  to  calculate  the  similarities  between  documents.  We  also  show  the  evaluation  results 
for  our  team  and  the  comparison  with  the  best  and  median  evaluations. 

1 .  Introduction 

Knowledge  Base  Acceleration  (KBA)  seeks  to  help  humans  expand  knowledge  bases  like 
Wikipedia  by  automatically  recommending  edits  based  on  incoming  content  streams.  For  our  first 
year  in  TREC,  we  are  evaluating  systems  on  a  single,  simple  task  called  cumulative  citation 
recommendation:  filter  a  stream  of  content  for  information  that  should  be  linked  from  a  given 
Wikipedia  page  or  an  specific  entity. 


Figure  1.  The  framework  of  KBA  system 


Figure  1  shows  the  framework  of  our  KBA  system.  First  of  all,  we  focused  on  the  “cleansed” 
and  “NER”  part  of  the  corpus.  Preprocessing  filtered  out  the  useless  documents  and  information 
and  built  the  Indri  index  of  the  remain  corpus.  Secondly,  the  relevance  feedback  was  conducted  to 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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expand  the  query  information.  Three  expanded  terms  were  generated  for  each  entity.  We  used 
these  terms  to  query  the  index  and  obtained  an  initial  candidate  for  the  relevant  documents  to  be 
recommended.  Finally  we  utilized  an  variation  of  the  Jaccard  coefficient  to  calculate  the 
similarities  between  documents  and  generate  the  final  recommended  documents  according  to  a 
threshold. 


2.  Preprocessing  and  Index  Building 

To  fulfill  the  succeeding  algorithm,  we  need  to  preprocess  the  original  corpus  and  build  a  index 
for  the  retrieval  system. 

After  deciphering  the  corpus  using  a  standard  gpg  and  XZ  decompression,  we  get  the  original 
data  collected  from  Wikipedia.  The  corpus  has  been  split  into  three  components:  linking,  social 
and  news.  We  only  focused  on  documents  labeled  with  ’cleansed'  &  'ner',  and  extracted  essential 
part  for  index  building.  Then  some  text  processing  procedures  were  executed  for  these  documents: 

•  Non-English  text  deletion 

•  Lowercasing  the  capital  letters 

•  Removing  the  external  linking  inside  the  text 

•  Abbreviation  expansion 

•  Removing  useless  punctuations 

We  converted  the  documents  into  the  “trectexf  ’  format  used  by  Indri  toolset  for  building  index. 
Besides  the  text  itself,  we  kept  the  information  of  “DOCNO”,  “stream_id”  and  “Time”.  We  used  a 
simple  stop  word  list  to  help  Indri  exclude  useless  words.  In  addition,  the  Porter  algorithm  was 
used  for  the  stemming  task. 


3.  Relevance  Feedback 

KB  A  uses  entities  as  filter  topics  for  this  year’s  CCR  task.  However,  it  is  not  enough  to  retrieve 
the  index  just  according  to  a  single  entity  name.  In  order  to  get  more  information  about  the  topic, 
we  expanded  the  topic  entity  utilizing  two  kinds  of  profiles.  One  is  the  Wikipedia  page  of  the 
entity  and  another  is  the  annotation  set  provided  by  TREC.  From  the  annotation,  we  picked  out 
documents  labeled  with  either  4R’  (Relevant)  or  4C’  (Central)  for  each  entity. 

After  that  we  used  the  following  formula  to  calculate  the  weight  of  each  term: 

Pml(t|Md)=^  (1) 


PavgW  = 


^d(ted)  pml(t|Md) 
dft 


(2) 


where  tf(td)  is  the  raw  term  frequency  of  term  t  in  document  d,  dld  is  the  total  number  of  tokens 
in  document  d,  dft  is  the  document  frequency  of  t  and  Pavg(t)  is  the  weight  of  each  word.  Then 
we  set  a  threshold  to  choose  the  top  three  words  as  the  final  expanded  queries.  Besides  the  initial 
entity  topic,  these  terms  were  queried  searching  the  index  to  find  out  the  candidate  similar 
documents. 


4.  Similarity  Calculation 


For  the  purpose  of  generating  final  recommended  documents  from  the  candidate  above,  we 
utilized  the  Jaccard  coefficient  to  calculate  the  similarity  between  candidate  documents  and  the 
original  Wikipedia  page  for  each  topic  entity.  The  Jaccard  similarity  coefficient  is  a  statistic  used 
for  comparing  the  similarity  and  diversity  of  sample  sets.  The  Jaccard  coefficient  measures 
similarity  between  sample  sets,  and  is  defined  as  the  size  of  the  intersection  divided  by  the  size  of 
the  union  of  the  sample  sets.  We  used  an  variation  of  the  traditional  Jaccard  formula  for  our 
specific  task  showing  as  follows: 
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where  wiki  and  d  stands  for  Wikipedia  page  and  candidate  document  respectively.  t/(£)  means 
the  term  frequency  of  t.  The  calculated  Jaccard  coefficient  should  be  multiplied  by  1000  as  the 
final  confidence  score  for  each  candidate.  We  then  compared  the  confidence  score  with  the 
similarity  threshold:  if  the  coefficient  is  larger  than  the  threshold,  the  document  is  recommended. 
The  threshold  is  actually  set  from  400  to  1000. 

5.  Evaluation  Results 

We  have  submitted  up  to  7  runs  for  this  year’s  task.  Due  to  the  limited  space,  we  only  show  the 
best  results  of  us  and  the  comparison  of  others. 

Table  1  and  Figure  2  shows  the  Precision,  Recall,  FI  and  Scaled  Utility  of  our  run.  It  can  be 
seen  that  FI  measure  increases  when  the  cutoff  goes  down  and  arrives  peak  at  400  cutoff,  whereas 
the  Scaled  Utility  shows  an  inverse  trend. 


Table  1.  Average  performance  of  the  PRIS  run 


cutoff 

Precision 

Recall 

FI 

Scaled  Utility 

0 

0.267298 

0.05809 

0.067795 

0.250206 

100 

0.267298 

0.05809 

0.067795 

0.250206 

200 

0.267298 

0.05809 

0.067795 

0.250206 

300 

0.267298 

0.05809 

0.067795 

0.250206 

400 

0.267298 

0.05809 

0.067795 

0.250206 

500 

0.210405 

0.02739 

0.041212 

0.292443 

600 

0.056385 

0.005454 

0.008731 

0.304233 

700 

0.03046 

0.003025 

0.005293 

0.319105 

800 

0 

0 

0 

0.325258 

900 

0 

0 

0 

0.33285 

Figure  2.  Average  performance  of  the  PRIS  run 


Table  2.  Comparison  with  the  Best,  Median  and  Mean  FI  measure  on  cutoff  400 


URL  name 

PRIS 

Best 

Median 

Mean 

Aharon  Barak 

0.1163 

0.3841 

0.1909 

0.1664 

Alex Kapranos 

0 

0.4298 

0.2706 

0.2263 

Alexander McCall Smith 

0.0832 

0.3963 

0.1955 

0.1593 

Annie Laurie Gaylor 

0.0233 

0.5021 

0.3046 

0.2304 

Basic Element  (company) 

0.0952 

0.8497 

0.1670 

0.2714 

Basic Element  (music group) 

0.0104 

0.8483 

0.0757 

0.1238 

BillCoen 

0.0769 

0.4375 

0.1984 

0.1709 

Boris Berezovsky(businessman) 

0.0015 

0.5371 

0.4859 

0.3503 

Boris Berezovsky(pianist) 

0 

0.5714 

0.0045 

0.0369 

Charlie Savage 

0.0202 

0.6846 

0.1135 

0.1339 

DarrenRowse 

0.1505 

0.3271 

0.1910 

0.1676 

Douglas Carswell 

0 

0.5562 

0.1352 

0.1286 

Frederick M.  Lawrence 

0.1818 

0.7027 

0.2684 

0.2621 

Ikuhisa Minowa 

0 

0.5860 

0.5229 

0.3749 

James McCartney 

0.0293 

0.5757 

0.2637 

0.2275 

JimSteyer 

0.0556 

0.7419 

0.4599 

0.3296 

Lisa  Bloom 

0.0566 

0.6341 

0.1302 

0.1524 

Lovebug Starski 

0 

0.2462 

0.1176 

0.0913 

Mario Garnero 

0.4930 

0.9211 

0.7741 

0.6095 

MasaruEmoto 

0.1091 

0.2 

0.1014 

0.0843 

Nassim Nicholas Taleb 

0.0056 

0.4747 

0.3143 

0.2578 

Rodrigo Pimentel 

0.0390 

0.5385 

0.0751 

0.1168 

Roustam Tariko 

0.0408 

0.4982 

0.3634 

0.2786 

Ruth  Rendell 

0.0132 

0.4430 

0.3357 

0.2304 

Satoshi Ishii 

0.0061 

0.6556 

0.4239 

0.3266 

VladimirPotanin 

0.0552 

0.7508 

0.2556 

0.2580 

William  Cohen 

0 

0.3484 

0.0816 

0.0815 

William D.Cohan 

0.0529 

0.6538 

0.3458 

0.3041 

William H.Gates,Sr 

0.3039 

0.3943 

0.1803 

0.1491 

average 

0.0678 

0.4263 

0.2506 

0.2066 

Table  2  shows  the  comparison  between  our  run  and  the  best,  median  and  mean  results  on  FI 
measure  at  cutoff  400.  We  can  conclude  that  the  FI  measures  of  two  entities  (Masaru_Emoto  and 
William_H.Gates,_Sr)  are  higher  than  the  median  and  mean  results;  the  FI  measures  of  four 
entities  (Aharon_Barak,  Darren_Rowse,  Frederick_M._Lawrence  and  Mario_Gamero)  are 
comparable  to  the  median  and  mean  while  the  results  of  remaining  entities  are  lower  than  average. 

The  average  F  value  is  0.0678  while  the  average  median  and  mean  is  0.2506  and  0.2066 
respectively,  which  means  that  there  is  still  a  large  room  for  improvement. 
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